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F a <- set of features present in s = [s p ... , s n ] and a = [a p ... » aj 

a «- argmax a Q, ** 

with probability £ = 1/t: a <- random aaion 6 A(s) 

Repeat for each step of the episode: 

Take aaion a 

Observe r (the reward) 

Observe s' (the next state) 

ForaUz* € A(sO 

F a , <- set of features present in s y = [s\, ... , s'J and a' = [a' 1( ... , a'J 
a* <— argmax a . Qj. ** 

with probability £ - l/t: a' <- random aaion e A(s") 
9 <_ 9 + a[r + y Q( s ^ a >) . Q( s> a )]V 6 Q(s, a, 0), 
where Q(s\ a') = Q,. and Q(s, a) = Q, 
a <- a' 

t <r- t + 1 
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* For each aaion- state pair there will be an estimate of the value of the pair, based on the sum of the values in 9 
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** Find the highest state- action value for this state and choose the corresponding action. 
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